This document describes the results of the multi-omics non-negative factorization (NMF)-based clustering module. For more information about the module please visit the PANOPLY Wiki page.
The data matrix subjected to NMF analysis contained 39785 features measured across 76 samples. Table 1 summarizes the number of features used in the clustering and their dataype(s).
| Type | Number of features |
|---|---|
| CNV | 14578 |
| prot | 7577 |
| pSTY | 3781 |
| RNA | 13849 |
To determine an optimal value k for the number of clusters, a range of k between 2 and 10 was evaluated using several metrics:
50 random iterations.The metrics are summarized in Figure 1. The optimal number of clusters is defined as the maximum of the product of coph and disp between k=3 and k=10.
Figure 1: Cluster metrics as a function of cluster numbers.
The 76 samples were separated into 3 clusters. Table 2 summarizes the number of samples in each cluster.
| Cluster | # samples | # core samples |
|---|---|---|
| C1 | 21 | 17 |
| C2 | 26 | 24 |
| C3 | 29 | 23 |
The heatmap shown in Figure 2 is a visualization of the meta-feature matrix derived from decomposing the input matrix, normalized per column by the maximum entry. The matrix presents one of the main results of NMF as it provides the basis of assigning samples to clusters.
Figure 2: Heatmap depicting the relative contributions of each sample (x-axis) to each cluster (y-axis). Samples are ordered by cluster and cluster membership score in decreasing order.
Table 3 summarizes the results of an overpresentation analysis of sample metadata terms (e.g. clinial annotation, inferred phenotypes, etc.) in each cluster. Shown are nominal p-values derived from a Fisher’s exact test (p<0.01, 0.01<p<0.02, 0.02<p<0.05). All samples with cluster memebrship score > 0.5 were used to characterize the clusters.
| C1 | C2 | C3 | |
|---|---|---|---|
| PAM50:Basal | 1.0000000 | 0.0000000 | 1.0000000 |
| PAM50:LumB | 1.0000000 | 0.9999685 | 0.0000000 |
| PAM50:LumA | 0.0000001 | 0.9999986 | 0.9079225 |
| ER.Status:Negative | 0.9998154 | 0.0000000 | 1.0000000 |
| ER.Status:Positive | 0.0028403 | 1.0000000 | 0.0000030 |
| PR.Status:Negative | 0.9973536 | 0.0000000 | 0.9999982 |
| PR.Status:Positive | 0.0155967 | 1.0000000 | 0.0000326 |
| TP53.mutation:1 | 0.9997822 | 0.0000027 | 0.9819766 |
| TP53.mutation:0 | 0.0019130 | 0.9999999 | 0.0585794 |
| PIK3CA.mutation:0 | 0.9590769 | 0.0294839 | 0.8605083 |
Matrix W containing the weights of each feature in a certain cluster was used to derive a list of r representative features separating the clusters using the method proposed in (Kim and Park, 2007). In order to derive a p-value for each cluster-specific feature, a 2-sample moderated t-test (Ritchie et al., 2015) was used to compare the abundance of the features between the respective cluster and all other clusters. Derived p-values were adjusted for multiple hypothesis testing using the methods proposed in (Benjamini and Hochberg, 1995). Features with FDR <are used in subsequent analyses.
Figure 3: Heatmap depicting abundances of cluster specific features defined as descibed above. Samples are ordered by cluster and cluster membership score in decreasing order.
In total 330 features separating the clusters have been detected using the method descibed above. The distribution of features across the different clusters are shown in Figure 4.
Figure 4: Barpchart depicting the number of cluster specific features
The data table below depicts all cluster specific features. The table is interactive and can be sorted and filtered. Please note that the table represents a condensed verison of the entire table which can be found the Excel sheet NMF_features_N_330.xlsx
The entries in the sample-by-samle matrix shown in Figure 5 depict the relative frequences with which two samples were assigned to the same cluster across 50 iterations.
Figure 5: Consensus matrix derived from 50 randomly initialized iterations.
Silhouette scores indicate how similar a sample is to its own cluster compared to other clusters. The silhouette plot shown in Figure 6 depicts the consistency of the derived clusters. Samples with negative silhouette score indicate outliers in the respective cluster.
Figure 6: Silhouette plot illustrating the silhouette score (x-axis) for each sample (y-axis) grouped by each cluster (K=3). Number of samples and average silhouette scores per cluster are shown on the right side.
Details about the parameters listed in Table 4 can be found in the PANOPLY WIKI.
| param | value |
|---|---|
| kmin | 2 |
| kmax | 10 |
| exclude_2 | TRUE |
| core_membership | 0.5 |
| nrun | 50 |
| seed | random |
| method | brunet |
| bnmf | FALSE |
| feature_fdr | 0.01 |
| ora_pval | 0.01 |
| ora_max_categories | 10 |
| hm_cw | 5 |
| hm_ch | 8 |
| hm_max_val | 10 |
| hm_max_val_z | 4 |
| filt_mode | global |
| sd_filt | 0.05 |
| z_score | TRUE |
| impute | FALSE |
| impute_k | 5 |
| max_na_row | 0.3 |
| max_na_col | 0.9 |
| gene_col | geneSymbol |
| nmf_only | FALSE |
| organism | human |
| tar_file | /cromwell_root/fc-3c36c89e-bca9-4372-b58b-2a820ecb71ef/44eb133e-5a7b-4197-9ac0-f3d01be82d04/panoply_unified_workflow/4f1cfe26-53b7-4522-ad32-8082485723c9/call-nmf/mo_nmf_wdl.panoply_mo_nmf_gct_workflow/da0d8709-69d1-40f2-887f-aa617216d795/call-panoply_mo_nmf_pre/cacheCopy/all.tar |
| lib_dir | /home/pgdac/src/ |
| yaml_file | /cromwell_root/fc-3c36c89e-bca9-4372-b58b-2a820ecb71ef/panoply-parameters.yaml |
| help | FALSE |
| cat_anno | PAM50;ER.Status;PR.Status;HER2.Status;TP53.mutation;PIK3CA.mutation;GATA3.mutation |
| cont_anno | NA |
| cat_colors | PAM50=Her2:#F9BFCB;Basal:#EE2025;LumB:#ADDAE8;LumA:#3953A5|ER.Status=Negative:#FFFFFF;Positive:#000000|PR.Status=Negative:#FFFFFF;Positive:#000000|HER2.Status=Positive:#000000;Negative:#FFFFFF;Equivocal:#808080|TP53.mutation=1:#000000;0:#FFFFFF|PIK3CA.mutation=0:#FFFFFF;1:#000000|GATA3.mutation=0:#FFFFFF;1:#000000 |
| blank_anno | N/A |
| blank_anno_col | white |
Created on 2020-11-13 05:06:57